Risk factors associated with post-acute sequelae of SARS-CoV-2: an N3C and NIH RECOVER study

Background More than one-third of individuals experience post-acute sequelae of SARS-CoV-2 infection (PASC, which includes long-COVID). The objective is to identify risk factors associated with PASC/long-COVID diagnosis. Methods This was a retrospective case–control study including 31 health systems in the United States from the National COVID Cohort Collaborative (N3C). 8,325 individuals with PASC (defined by the presence of the International Classification of Diseases, version 10 code U09.9 or a long-COVID clinic visit) matched to 41,625 controls within the same health system and COVID index date within ± 45 days of the corresponding case's earliest COVID index date. Measurements of risk factors included demographics, comorbidities, treatment and acute characteristics related to COVID-19. Multivariable logistic regression, random forest, and XGBoost were used to determine the associations between risk factors and PASC. Results Among 8,325 individuals with PASC, the majority were > 50 years of age (56.6%), female (62.8%), and non-Hispanic White (68.6%). In logistic regression, middle-age categories (40 to 69 years; OR ranging from 2.32 to 2.58), female sex (OR 1.4, 95% CI 1.33–1.48), hospitalization associated with COVID-19 (OR 3.8, 95% CI 3.05–4.73), long (8–30 days, OR 1.69, 95% CI 1.31–2.17) or extended hospital stay (30 + days, OR 3.38, 95% CI 2.45–4.67), receipt of mechanical ventilation (OR 1.44, 95% CI 1.18–1.74), and several comorbidities including depression (OR 1.50, 95% CI 1.40–1.60), chronic lung disease (OR 1.63, 95% CI 1.53–1.74), and obesity (OR 1.23, 95% CI 1.16–1.3) were associated with increased likelihood of PASC diagnosis or care at a long-COVID clinic. Characteristics associated with a lower likelihood of PASC diagnosis or care at a long-COVID clinic included younger age (18 to 29 years), male sex, non-Hispanic Black race, and comorbidities such as substance abuse, cardiomyopathy, psychosis, and dementia. More doctors per capita in the county of residence was associated with an increased likelihood of PASC diagnosis or care at a long-COVID clinic. Our findings were consistent in sensitivity analyses using a variety of analytic techniques and approaches to select controls. Conclusions This national study identified important risk factors for PASC diagnosis such as middle age, severe COVID-19 disease, and specific comorbidities. Further clinical and epidemiological research is needed to better understand underlying mechanisms and the potential role of vaccines and therapeutics in altering PASC course. Supplementary Information The online version contains supplementary material available at 10.1186/s12889-023-16916-w.


Data
N3C structure, access, and analytic capabilities have been described in detail previously [20].The N3C collects information from single-and multi-hospital health systems across the U.S. and stores data in a central location, the N3C data enclave.As of April 14, 2022, it contained data from 72 health systems and > 4.9 million individuals with COVID-19.For this study, we used a limited data set, which contains deidentified data, five-digit patient ZIP codes, and exact dates of COVID-19 diagnoses and service use (eMethods) [21].

Study design and cohort (Fig. 1)
The study cohort is based on 4,559,795 potentially eligible patients from 59 health systems who were diagnosed with SARS-CoV-2 infection or had a positive polymerase chain reaction (PCR) or antigen (AG) lab test for SARS-CoV-2.Of these, 3,884,477 were adults (> 18 years of age).Individuals may have multiple SARS-CoV-2 infections, so we considered the earliest documented date of positive test or diagnosis as the COVID index date.An index date was required to determine the relative timing of infection and long-COVID diagnosis (International Classification of Diseases, Tenth Revision, Clinical Modification [ICD-10-CM] code U09.9) or long-COVID clinic visit.Not all health systems currently use U09.9 or have clinics dedicated to long-COVID treatment [22].Therefore, we limited our cohort to patients from the 31 health systems with at least one documented long-COVID case using U09.9 or a long-COVID clinic visit between Oct 1, 2021 and Feb 28, 2022 (n = 1,490,823).We excluded patients who died within 45 days of the index date because by definition they would not be at risk of developing PASC (n = 1,467,804).Finally, in order for patients to have an adequate observation period after acute infection, we required them to have their index acute infection date between March 1, 2020 and December 1, 2021 (N = 1,062,661).In this way, we employed a restrictive case definition to maximize the likelihood of selecting true cases of PASC from this base cohort.

Case and control selection
In our primary analyses, we defined cases as those with a documented U09.9 diagnosis or a documented long-COVID clinic visit flag in the N3C (n = 8,325).As a sensitivity analysis, we also defined cases as 1) U09.9only (n = 7,512) or 2) long-COVID clinic visits only (n = 1,241).
Controls were challenging to select because individuals may have had PASC but not received a diagnosis.We used three methods to identify controls, i.e., individuals without PASC.Our base analysis allowed any patient who was not a case to be considered as a possible matched control (not restricted controls).Additionally, for two control cohorts, we applied our previously developed computable phenotype (CP) model for long-COVID to refine our control patient pool [23].We applied CP model to the 1,054,336 non-cases (1,062,661-8,325) to generate a predicted probability for U09.9 diagnosis or long-COVID clinic visit.The models generate the predicted probability of PASC for 716,203 individuals who became eligible for matched control selection (eMethods).In each of the above three methods, we randomly matched 1 case to 5 controls without replacement from the same health system and COVID index date within ± 45 days of the corresponding case's earliest COVID index date.In the "unrestricted" method, We matched 8,325 cases to 41,625 controls in the "unrestricted" method, and 8,322 cases to 41,610 controls in the "restricted" and "more restricted controls" methods.

Risk factors
We used existing literature [10][11][12], clinical expertise, and availability of information in the N3C to identify potential risk factors for PASC that are identifiable in EHR data (Table 1 and Supplemental eTable 1 for full list).We used information before COVID-19 diagnosis date to identify an individual's age, gender, race/ethnicity (non-Hispanic White, non-Hispanic Black, Hispanic, Asians, others, and unknown), obesity (a diagnosis of obesity or a body mass index [BMI] > = 30), smoking status, substance abuse status, and comorbidities.We included 17 common comorbidities used in the Charlson Comorbidity Index [24] and additional comorbidities and treatments (e.g., use of corticosteroids) which are considered risk factors for severe acute COVID-19 as Table 1 Cohort Characteristics for PASC Cases defined by U09.9 or long-COVID clinic visit and three sets of controls a Only captured for individuals hospitalized for COVID-19 b The restricted samples (Methods 2 and 3) lose 3 cases due to not having sufficient controls (< 5 available controls).Comorbidities shown in this For SDoH, we used county-level variables from the Sharecare-Boston University School of Public Health Social Determinants of Health dataset [26].Specifically, we used percent of households with income below poverty, percent of residents with college degree, percent of residents 19-64 with public insurance, and physicians per 1000 residents [26].These are all included as tertiles in the analyses.

Statistical analysis
We used descriptive statistics to compare PASC cases with the three non-PASC control cohorts, including counts and percentages for categorical variables and means and standard deviation for continuous variables.
We used multivariable logistic regression to determine associations between risk factors and PASC.We constructed three separate logistic regression models for the three cohorts of matched cases and controls.All patient characteristics, with and without SDoH, were included as independent variables in the three models.We reported odds ratios (OR) and 95% confidence intervals (CI) for risk factors.
In addition to logistic regression, we used two machine learning methods, random forest (RF) [27] and XGBoost, to identify influential risk factors for developing PASC [28].Machine learning methods provide the ability to investigate massive datasets and reveal patterns within data without relying on a priori assumptions such as pre-specified statistical interactions, specific variable associations, or linearity in variable relationships [29].We conducted feature importance analysis for both RF and XGBoost models [30], and display SHAP (SHapley Additive exPlanations) plots [31] from the XGboost models (eMethods).All models included an indicator variable for missing race/ethnicity.All analyses were conducted using Python 3.6.

Secondary and stratified analysis
For the unrestricted controls and PASC cases defined by U09.9 or a long-COVID visit (primary cohort), we performed planned secondary analysis by including SDoH variables in logistic regression and two machine learning models.We performed stratified analysis by hospitalization status to assess whether risk factors differed for these two groups (eMethods).

Sensitivity analyses
To check the robustness of our results, we examined risk factors using the matched case-control design separately for cases identified: (a) using U09.9 diagnosis code and (b) based on long-COVID clinic visits, each with five matched controls.We refit each of the three model types in the above six cohorts of PASC cases and matched controls.
The performance of XGBoost and logistic regression models was similar (both AUC 0.73), closely followed by RF model (AUC 0.69) (eTable 3).Risk factors for PASC identified by the XGBoost models had a similar direction compared to logistic regression models (Table 2, eTable 4).However, risk factors' magnitude and order of importance varied between XGBoost and logistic regression.For example, invasive mechanical ventilation was ranked 6 by XGBoost versus 21 by logistic regression.

Restricted controls
eTable 5 and eTable 6 shows the importance of risk factors among less restrictive and more restrictive controls, respectively.For most patient characteristics, the direction and magnitude of the odds ratios were similar to the primary analysis (eTable 2).However, obesity was no longer significant when we used the less and more restrictive controls.Also, ECMO was associated with PASC when the more restrictive controls were used, but it was not a statistically significant factor when the unrestricted controls were used.

Secondary analysis including SDoH
We repeated our primary analysis (U09.9 or long-COVID clinic model, unrestricted control cohort) by adding SdoH variables (Fig. 2, eTable 7).The number of medical doctors per 1000 residents in the county of residence was associated with PASC, indicating having access to healthcare services increases the likelihood of diagnosis and/or treatment at a long-COVID clinic.Other SDoH factors were not associated with PASC in logistic regression but were important features in the machine learning models (eFigure 5, Table 3).

Stratified analysis by COVID-index hospitalization
To assess risk factors unique to less severe SARS-CoV-2 infections, we stratified analysis by whether the patient was hospitalized at the time of COVID-19 index date (eTables 8-13).For the hospitalized sample, the strongest risk factors across LR, XGBoost, and RF models are possible markers of COVID-19 severity (e.g., ECMO, ED Visit, Mechanical Ventilation) and obesity.Living in a community with higher education increased likelihood of diagnosis or care at a long-COVID clinic (eFigure 4).For those not hospitalized at COVID index date, the following risk factors pre-COVID differ from hospitalized patients: systemic corticosteroid use and depression, peptic ulcer, or coronary artery disease diagnosis.When we limit to non-hospitalized patients during COVID-19 index, some SDoH factors were also strong predictors including lower poverty and higher education communities (eFigure 6, eFigure 7).Some risk factors are common to both the hospitalized and non-hospitalized samples, including middle age (40-69), chronic lung disease, and white non-Hispanic race/ethnicity (eFigure 6, eFigure 7).

Sensitivity analysis: other definitions of PASC
We have described sensitivity analysis in detail in eResults.Overall, sensitivity analysis results based on only U09.9 definition or only long-COVID clinic visits were similar to the primary analysis.

Discussion
In this first large-scale US study of risk factors for PASC diagnosis or long-COVID clinic visit, we found that middle age (40 to 69 years), female sex, severity of acute infection (e.g., hospitalization for COVID-19, long or extended hospital stay, treatment for acute COVID-19 during hospitalization), and several comorbidities including depression, chronic lung disease, obesity, and malignant cancer were associated with increased likelihood of PASC diagnosis or care at a long-COVID clinic.Risk factors associated with a lower likelihood of PASC diagnosis or care at a long-COVID clinic included younger age (18 to 29 years), male sex, non-Hispanic Black race, and comorbidities such as substance abuse, cardiomyopathy, psychosis, and dementia.We also found that a greater number of physicians per capita in the county of residence were associated with an increased likelihood of PASC diagnosis or care.Our findings were consistent in sensitivity analyses using a variety of approaches to select controls and several robust analytic techniques.
Our findings add to the growing body of evidence identifying and characterizing PASC risk factors.Although females were less likely to die or be hospitalized due to acute COVID-19, [32,33], they appear to have a greater risk of developing PASC.Our finding that there is a higher likelihood of PASC diagnosis among middle-aged individuals is consistent with a recent United Kingdom Office for National Statistics analysis, but is in contrast with another report that found that older individuals were at the highest risk for PASC [8,12].Older adults are at greater risk of mortality from COVID-19 and older individuals may have died before developing PASC.Our analysis did not account for competing risk of death while studying PASC risk factors.Risk factors such as chronic lung disease, rheumatologic disease, and obesity were associated with both hospitalization and death due to COVID-19 and also increased risk of PASC diagnosis or care.
We previously established a machine learning phenotype [23] that used clinical features observed after COVID-19 infection to generate a probability for Table 2 Comparison of feature importance for PASC models defined by U09.9 or long-COVID clinic visit and unrestricted controls (Comapring 8,325 cases with 41,625 controls; Top 15 positive and negative features) This Table shows the top 15 features associated with increased risk and top 15 features associated with decreased risk.Complete models are shown in the Supplement.Unrestricted sample, U09.9 or long-COVID clinic visit target (see text).Grouped by median direction (increased/decreased) and ordered by mean rank.Model rank calculated based on sklearn.inspection.permutation_importance()(XGB/RF) or absolute ordered size of coefficient (LR).Mean rank is based on the rank of each model that had the variable in the model.Mint color indicates features associated with increased risk.Salmon color indicates features associated with decreased risk.An uncolored cell indicates that that feature was the reference group for the logistic regression model whether a patient currently has PASC.In contrast, the current analysis uses features selected from the acute phase of COVID-19 (such as pre-existing clinical comorbidities and hospitalization characteristics at the time of the initial infection) to assess risk factors for the later emergence of PASC as indicated by a U09.9 diagnosis or long-COVID clinic visit.It is possible that individuals with greater to healthcare may be more likely to have PASC diagnosis.We tried to control for this phenomenon by restricting to individuals who have at least one visit to a healthcare provider post-COVID in the CP model.The models in this analysis can be applied by clinicians to identify patients at risk for PASC while they are still in the acute phase of their infection and also to support targeted enrollment in clinical trials for preventing or treating PASC.
The association we found between more severe acute COVID-19 and increased likelihood of PASC is consistent with prior literature [34].Individuals who were hospitalized for COVID-19 or received intensive treatment may have long-lasting effects on the brain, heart, lungs, and other organs [35][36][37][38][39]. Counterintuitively, we found that diabetes, a strong risk factor for worse outcomes after acute COVID-19, was associated with less likelihood of PASC diagnosis.Our previous work has demonstrated that glycemic control in patients with diabetes, as measured by pre-infection HbA1c levels, is an important risk factor for poor acute infection outcomes [40].The level of granularity available in EHR data may not be sufficient to completely disentangle PASC risk associated with some comorbidities from PASC risk from SDoH and unmeasured biological features.We found that a pre-existing diagnosis of depression was associated with a higher risk of subsequent PASC.Interestingly, however, prior diagnoses of other mental health diagnoses (e.g., psychosis) were associated with lower risk.Comorbid substance abuse Fig. 2 Forest plots from logistic regression for unrestricted controls with SDoH (PASC defined as U09.9 or long-COVID Clinic Visit) Table 3 Comparison of Feature Importance for PASC Models defined by U09.9 or long-COVID clinic visit and unrestricted controls with SDoH variables included (Comapring 8,325 cases with 41,625 controls; Top 15 positive and negative features) This Table shows the Top 15 features associated with increased risk and top 15 features associated with decreased risk.Complete models are shown in the Supplement.Not restricted sample, U09.9 or long-COVID clinic visit target (see text).Grouped by median direction (increased/decreased) and ordered by mean rank.Model rank calculated based on sklearn.inspection.permutation_importance()(XGB/RF) or absolute ordered size of coefficient (LR).Mean rank is based on the rank of each model that had the variable in the model.Mint color indicates features associated with increased risk.Salmon color indicates features associated with decreased risk.An uncolored cell indicates that that feature was the reference group for the logistic regression model (also associated with lower likelihood of PASC diagnosis) with psychosis may explain some of this difference, as those with substance abuse disorders may have challenges health care.Antidepressants and antipsychotics have differential immunomodulatory effects, which could also contribute to this observation.Another interesting finding is that we found patients with comorbidities such as cardiomyopathy, metastatic solid tumors, and liver disease that made them vulnerable to worse outcomes after acute COVID-19 had lower likelihood of PASC diagnosis.Although we cannot determine causality from this association, this finding may be hypothesis-generating.
The association we found between higher numbers of doctors per capita with PASC diagnosis or care underscores the importance of access to medical care.Given the disruption of medical care for both COVID and non-COVID illnesses during the pandemic, it is important to improve access to care, particularly for minorities [41].Our findings of lower likelihood of PASC diagnosis among non-Hispanic Blacks support this hypothesis.The focus of this study was to investigate patient-level factors and therefore we did not consider several SDoH that can impact PASC risk such as essential worker status, financial issues, housing, and isolation.These are excellent candidate variables for future study [42].Future research is also required to delineate the complex relationship of individual vs. contextual factors in the diagnosis and care for PASC.Policy measures such as strengthening primary care, optimizing SDoH data quality, and addressing SDoH are required to reduce inequalities in diagnosis and care for PASC [17].
The US Government Accountability Office estimates that between 7.7 and 23 million US adults have PASC [43].Given the potential clinical and economic consequences, the US government has allocated over a billion dollars to study it [44].Our study validates some findings of prior studies on PASC risk factors and provides novel information including the impact of SDoH.With the sample size available in N3C, we can evaluate more risk factors simultaneously than previous studies.Also, this study can be used to generate hypotheses about possible mechanisms and potential treatments for PASC.For example, because this study found that rheumatological conditions are a risk factor for PASC, future studies can assess whether treatment for rheumatological conditions can alter the likelihood of PASC diagnosis.
Our study has several limitations.First, the N3C only contains EHR data, which has inherent limitations and may encode biases related to health care access and racism [22].To get complete and accurate information on PASC diagnosis, we restricted cohort to health systems that used the ICD-10-CM code for PASC or had a Long COVID clinic visit at the time of the analysis.This limits the generalizability of our study findings to all health care systems within N3C or to the U.S. population, although it is likely that more U.S. health care systems now use the ICD-10-CM code as doctors and patients have gained understanding of PASC.Therefore, our findings on risk factors may generalize to the broader US population.Second, our definition for selecting individuals with PASC is narrow, as it only includes those who received a long-COVID diagnosis or visited a clinic for long-COVID.Therefore, it is likely that we missed individuals who had symptoms or conditions associated with long-COVID but did not receive a PASC diagnosis code or have not visited a long-COVID clinic.However, this should not affect our results because we included true positives and attempted to include true negatives to determine risk factors.Third, because identification of individuals without PASC (controls) is not straightforward without clear definitions or biomarkers, we used three approaches to identify controls.Two of those leveraged our CP classification model for long-COVID [23].Importantly, however, model performance did not have clinically meaningful differences across different cohort selection methods.Fourth, further analysis is needed to determine the role of SDoH and how it impacts individual-level risk factors for PASC.While research shows that county-level SDoH variables can be significant for patient-level analysis, more granular geographic unit or patient-level data would likely provide a greater understanding of the relationship between SDoH and PASC outcomes [45,46].Fifth, we did not evaluate the role of vaccines and therapeutics such as paxlovid for the likelihood of PASC diagnosis.Sixth, we did not evaluate the association of COVID-19 reinfection and PASC diagnosis or care.Seventh, we excluded children from this analysis because the burden and clinical features of COVID-19 may differ significantly between adults and children [47].Eight, our study numbers should not be used to estimate the prevalence of PASC in general population as it only identifies individuals with clinical diagnosis of PASC or long-COVID clinic visits.Ninth, there may be a possibility of residual confounding in this study because we do not include all potential risk factors for PASC.

Conclusions
This national study using N3C data identified important risk factors for PASC diagnosis such as middle age, severe COVID-19 disease, and comorbidities.Further clinical and epidemiological research is needed to better understand underlying mechanisms and the potential role of vaccines and therapeutics in altering the course of PASC.

1)
Unrestricted controls (Method 1): All individuals who were not identified as cases became eligible (n = 1,054,336).2) Restricted controls (Method 2): We excluded individuals highly suspected of having long-COVID, defined as a predicted probability >= 0.75 based on the CP model of having a U09.9 diagnosis and having visited a long-COVID clinic.Overall, 621,374 individuals became eligible for controls.3) More restricted controls (Method 3): We included individuals highly suspected of not having long-COVID (predicted probability <= 0.25) based on the CP model of having a U09.9 diagnosis and a long-COVID clinic visit.Overall, 496,073 individuals became eligible for controls.

Supplement PASC (N = 8325) b Method 1 Unrestricted controls (N = 41,625) Method 2 Restricted controls (N = 41,610) Method 3 Most restricted controls (N = 41,610) Demographics
Table are selected.A comprehensive stratification by comorbidities is in the